{ "cells": [ { "cell_type": "markdown", "id": "f4ccfb71", "metadata": {}, "source": [ "# Parallelizing Multiple Grid Cells on a High-Performance Computing Cluster" ] }, { "cell_type": "markdown", "id": "1f35a372", "metadata": {}, "source": [ "Building off the previous tutorial, this notebook demonstrates how to use the [Dask PBSCluster](https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.PBSCluster.html) object to parallelize MUSICA simulations across HPC systems. While MUSICA is locally efficient for simulations up to around 10,000 grid cells, scaling to larger domains can benefit from parallelizing grid cell calculations to improve runtime performance. This tutorial walks through best practices for setting up and running parallel MUSICA workflows, with concrete examples and reference scaling tests performed on [NCAR’s Casper HPC system](https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/casper/)." ] }, { "cell_type": "markdown", "id": "3adde55e", "metadata": {}, "source": [ "## 1. Importing Libraries" ] }, { "cell_type": "markdown", "id": "35845be0", "metadata": {}, "source": [ "In addition to libraries previously used throughout MUSICA tutorials, this tutorial uses Dask. Note that if you'd like to visualize Dask graphs, you will also need [Graphviz](https://graphviz.org) installed." ] }, { "cell_type": "code", "execution_count": null, "id": "47021626", "metadata": {}, "outputs": [], "source": [ "#import libraries\n", "import musica\n", "import musica.mechanism_configuration as mc\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np\n", "from dask import delayed, compute\n", "from dask.distributed import Client\n", "from dask_jobqueue import PBSCluster\n", "import numpy as np\n", "import time\n", "from scipy.stats import qmc\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "id": "49c68e4d-d477-447e-83fc-08008147e478", "metadata": {}, "source": [ "## 2. Dask Cluster Set up" ] }, { "cell_type": "markdown", "id": "95d2d7b8", "metadata": {}, "source": [ "We first need to set up a Dask Cluster object. In this tutorial, we use the PBSCluster class since our HPC system uses a PBS-based scheduler, but Dask also provides an equivalent [SLURMCluster](https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html) class for SLURM-based systems. When initializing the cluster, you’ll notice it accepts many arguments that may look familiar from your system’s job scripts such as requested memory, number of cores, and walltime. Keep in mind that these values should be adapted to fit your particular workload and system constraints. As a general guideline, we find MultiGrid cell simulations under ~10,000 cells to run comfortably with about 8 GB of memory, while larger simulations ranging from 100,000 to 1,000,000 grid cells may need closer to 15 GB to run efficiently. It can be helpful to initialize the PBSCluster with extra memory as done below (10 GB)." ] }, { "cell_type": "code", "execution_count": null, "id": "edc588b5-6a65-48f2-b3a6-437bcdf066c5", "metadata": {}, "outputs": [], "source": [ "#casper\n", "cluster = PBSCluster(\n", " job_name = 'dask-test',\n", " cores = 1,\n", " memory = '10GiB',\n", " processes = 1,\n", " local_directory = '/local_scratch/pbs.$PBS_JOBID/dask/spill',\n", " resource_spec = 'select=1:ncpus=1:mem=10GB', #memory and resource especially memory should match\n", " queue = 'casper',\n", " walltime = '50:00',\n", " interface = 'ext'\n", ")" ] }, { "cell_type": "markdown", "id": "bb06180d", "metadata": {}, "source": [ "Prior to running your simulation, it can be helfpul to check the active Dask PBS configuration and resulting job scrip that will be used for your Dask workers." ] }, { "cell_type": "code", "execution_count": null, "id": "8b7564e9-2401-489d-a09e-9915a26ddf26", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'name': 'dask-worker',\n", " 'cores': None,\n", " 'memory': None,\n", " 'processes': None,\n", " 'python': None,\n", " 'interface': None,\n", " 'death-timeout': 60,\n", " 'local-directory': None,\n", " 'shared-temp-directory': None,\n", " 'extra': None,\n", " 'worker-command': 'distributed.cli.dask_worker',\n", " 'worker-extra-args': [],\n", " 'shebang': '#!/usr/bin/env bash',\n", " 'queue': None,\n", " 'account': None,\n", " 'walltime': '00:30:00',\n", " 'env-extra': None,\n", " 'job-script-prologue': [],\n", " 'resource-spec': None,\n", " 'job-extra': None,\n", " 'job-extra-directives': [],\n", " 'job-directives-skip': [],\n", " 'log-directory': None,\n", " 'scheduler-options': {}}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#view configuration file(s) within Python\n", "from dask import config\n", "config.refresh()\n", "config.get('jobqueue.pbs')" ] }, { "cell_type": "code", "execution_count": 4, "id": "5a02e45d-7e69-4cc6-8ffb-47f49d52a2d5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#!/usr/bin/env bash\n", "\n", "#PBS -N dask-test\n", "#PBS -q casper\n", "#PBS -A NTDD0005\n", "#PBS -l select=1:ncpus=1:mem=10GB\n", "#PBS -l walltime=50:00\n", "\n", "/glade/work/apak/conda-envs/musicbox/bin/python -m distributed.cli.dask_worker tcp://128.117.208.119:36453 --name dummy-name --nthreads 1 --memory-limit 10.00GiB --nanny --death-timeout 60 --local-directory /local_scratch/pbs.$PBS_JOBID/dask/spill --interface ext\n", "\n" ] } ], "source": [ "print(cluster.job_script())" ] }, { "cell_type": "markdown", "id": "4bd22360", "metadata": {}, "source": [ "As in the [previous tutorial](4.%20local_parallelization.ipynb), the Dask Cluster provides a convenient and interactive dashboard to visualize your parallelization work." ] }, { "cell_type": "code", "execution_count": 5, "id": "9d16e1f9-608c-4c5e-84ca-83a118886299", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Client-05c11d79-6197-11f0-9aeb-ac1f6bab1e7a
\n", "Connection method: Cluster object | \n", "Cluster type: dask_jobqueue.PBSCluster | \n", " \n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/apak/proxy/8787/status\n", " | \n", "\n", " |
0e43aec4
\n", "\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/apak/proxy/8787/status\n", " | \n", "\n", " Workers: 0\n", " | \n", "
\n", " Total threads: 0\n", " | \n", "\n", " Total memory: 0 B\n", " | \n", "
Scheduler-31c8add4-0019-4525-a08f-f35e07e0a301
\n", "\n", " Comm: tcp://128.117.208.119:36453\n", " | \n", "\n", " Workers: 0 \n", " | \n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/apak/proxy/8787/status\n", " | \n", "\n", " Total threads: 0\n", " | \n", "
\n", " Started: Just now\n", " | \n", "\n", " Total memory: 0 B\n", " | \n", "
\n", " | time.s | \n", "ENV.temperature.K | \n", "ENV.pressure.Pa | \n", "ENV.air number density.mol m-3 | \n", "CONC.A.mol m-3 | \n", "CONC.B.mol m-3 | \n", "CONC.C.mol m-3 | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "9.768133 | \n", "7.142146 | \n", "8.698319 | \n", "
1 | \n", "0 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "5.780753 | \n", "9.547887 | \n", "8.133570 | \n", "
2 | \n", "0 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "4.183914 | \n", "0.864572 | \n", "0.116974 | \n", "
3 | \n", "0 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "0.358964 | \n", "8.126227 | \n", "6.105201 | \n", "
4 | \n", "0 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "5.233886 | \n", "4.299744 | \n", "6.196699 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
609995 | \n", "60 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "0.001229 | \n", "0.011452 | \n", "12.930998 | \n", "
609996 | \n", "60 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "0.000421 | \n", "0.004933 | \n", "18.148737 | \n", "
609997 | \n", "60 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "0.000106 | \n", "0.002043 | \n", "6.835295 | \n", "
609998 | \n", "60 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "0.000370 | \n", "0.004365 | \n", "13.947568 | \n", "
609999 | \n", "60 | \n", "301.626773 | \n", "101259.516561 | \n", "40.376789 | \n", "0.000526 | \n", "0.004779 | \n", "12.959798 | \n", "
610000 rows × 7 columns
\n", "